About the project

Introduction to Open Data Science is a course organized by Doctoral school in the humanities and social sciences (HYMY),University of Helsinki. As it is mentioned in the course page “The name of this course refers to THREE BIG THEMES: 1) Open Data, 2) Open Science, and 3) Data Science”. The learning platform for this course is MOOCs (Massive Open Online Courses) of the University of Helsinki: http://mooc.helsinki.fi. More information about the course can be found in this link: https://courses.helsinki.fi/en/hymy005/120776718.


My github repository for IODS-project

GitHub


Tools and Methods for open and reproducible research

In the first week of the course, I have gone through the general instructions of the course and get familiarize my self with the learning tools, the course content and schedule. Though I had previously experience with R and RStudio, I have done all the exercises and instructions given in DataCamp: R Short and Sweet and refresh my R. I completed all R Short and Sweet exercise and statement of accomplishment published here. The other exerices, I have done in this first week is RStudio Exercise 1. I already had a GitHub account and to follow the exercises and instructions for this exercise, I forked the IODS-project repository from Tuomo Nieminen’s github. After forking, I clone the IODS-project repository to my local machine (my computer) using the command git clone. After cloned the respostry to my computer, Using RStudio, I edited the existing Rmarkdown file (chapter1.Rmd, chapter2.Rmd and index.Rmd) in the repository and also add new Rmarkdown file and save as in the file name README.Rmd. Next, I commit the changes what have been done in the local machine using the git command git commit -m and finally push to github using the command git push. I have also canged the theme of my course diary web page to Merlot theme. In this exercises, I refresh my git and Github knowledge and also familiarize with how I will use the GitHub for this course to share and publish my learning diary.


Regression and Model validation

Data wrangling

In this section, the data set given in this link has been preprocess for further/downstream analysis. After creating a folder named ‘data’ in my IODS-project repository, using Rstudio a new R script named ‘create_learning2014’ file created and write all the steps used in the data wrangling process and saved in the ‘data’ folder. All the steps I have done in the data wrangling preprocess can be find here and/or in the code shown below.

Analysis

The data for this section is extracted from a survey conducted by Kimmo Vehkalahti on teaching and learning approaches. The research project made possible by the Academy of Sciences funding to Teachers’ Academy funding (2013-2015). The data was collected from 3.12.2014 - 10.1.2015.The surrvey includes 183 respondants and the questioner in toatal includes 60 variables/questions.

students2014=read.table("data/learning2014.txt", sep="\t", header=TRUE)
dim(students2014)
## [1] 166   7
str(students2014)
## 'data.frame':    166 obs. of  7 variables:
##  $ gender  : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
##  $ Age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: num  3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ Points  : int  25 12 24 10 22 21 21 31 24 26 ...
summary(students2014)
##  gender       Age           attitude          deep            stra      
##  F:110   Min.   :17.00   Min.   :1.400   Min.   :1.583   Min.   :1.250  
##  M: 56   1st Qu.:21.00   1st Qu.:2.600   1st Qu.:3.333   1st Qu.:2.625  
##          Median :22.00   Median :3.200   Median :3.667   Median :3.188  
##          Mean   :25.51   Mean   :3.143   Mean   :3.680   Mean   :3.121  
##          3rd Qu.:27.00   3rd Qu.:3.700   3rd Qu.:4.083   3rd Qu.:3.625  
##          Max.   :55.00   Max.   :5.000   Max.   :4.917   Max.   :5.000  
##       surf           Points     
##  Min.   :1.583   Min.   : 7.00  
##  1st Qu.:2.417   1st Qu.:19.00  
##  Median :2.833   Median :23.00  
##  Mean   :2.787   Mean   :22.72  
##  3rd Qu.:3.167   3rd Qu.:27.75  
##  Max.   :4.333   Max.   :33.00

After preprocess the raw data, the final data set have 7 column “gender, age, attitude, deep, stra, surf and points” and 166 indvidual as shown in the above run.

plot_students2014=ggpairs(students2014, mapping = aes(col = gender,alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)))

plot_students2014

The above plot shown there are 110 femal and 56 male respondants in the survey. Moreover the plot shown how the variables (Points, Age, attitude, deep, stra and Surf) are correleted. Based on the absolute correlation value the attitude and points is higly correleted while deep learning and ponts are least corrleted.

The explanatory variable chosen based on (absolute) correlation value and the top three explanatory variable chosen are attitude= 0.437, stra= 0.146 and surf= -0.144. Using this variables the model is build using the R function “lm”.

my_model1 <- lm(Points ~ attitude + stra + surf, data = students2014)

summary(my_model1)
## 
## Call:
## lm(formula = Points ~ attitude + stra + surf, data = students2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.1550  -3.4346   0.5156   3.6401  10.8952 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.0171     3.6837   2.991  0.00322 ** 
## attitude      3.3952     0.5741   5.913 1.93e-08 ***
## stra          0.8531     0.5416   1.575  0.11716    
## surf         -0.5861     0.8014  -0.731  0.46563    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared:  0.2074, Adjusted R-squared:  0.1927 
## F-statistic: 14.13 on 3 and 162 DF,  p-value: 3.156e-08

The significance paramter from the above summary table is the intercept 0.00322 and attitude (1.93e-08) ***. Hence stra nad surf are not condisdered in the model below. using again “lm” function linear regrression model build between points and attitude.The model after removing insignificant variables is summarized below. With regard to multiple r-squared value, we saw that value changed from 0.1927 (in older model) to 0.1856 (newer model). However, F-Statistic (from 14.13 to 38.61) and p-value(3.156e-08 to 4.119e-09) have significantly improved. Thus, we can conclude that r-squared value may not always determine the quality of the model and the lower r-squared value might be due to outliers in the data.

my_model2 <- lm(Points ~ attitude , data = students2014)
summary(my_model2)
## 
## Call:
## lm(formula = Points ~ attitude, data = students2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.6372     1.8303   6.358 1.95e-09 ***
## attitude      3.5255     0.5674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09

Model Validation

par(mfrow = c(2,2))

plot(my_model2, which=c(1,2,5))

The three diagnostic model validation plots are shown above.The assumptions are

  • The errors are normally distributed
  • The errors are not correleted
  • The errors have constant variance,
  • The size of a given error doesn’t depend on the explanatory variables

Based on the above plots, we can conclude that the errors are normally distributed (clearly observed in q-q plot). Similarly, residual versus fitted model showed that the errors are not dependent on the attitude variable. Moreover, we can see that even two points (towards the right) have minor influence to the assumption in case of residual vs leverage model. All the models have adressed the outliers nicely. Thus, assumptions in all models are more or less valid.


Logistic regression

Data wrangling

library(GGally)
library(ggplot2)
library(ggpubr)
## Loading required package: magrittr
library(tidyr)
## 
## Attaching package: 'tidyr'
## The following object is masked from 'package:magrittr':
## 
##     extract
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:GGally':
## 
##     nasa
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(magrittr)
library(gmodels)
library(boot)

In this section Data wrangling has been done for two data sets retrieved from UCI Machine Learning Repository. The R script used for data wrangling can be found here

Analysis

The data sets used in this analysis were retrieved from the UCI Machine Learning Repository.The data sets are intend to approach student achievement in two secondary education Portuguese schools. The data was collected by using school reports and questionnaires that includes data attributes (student grades, demographic, social and school related features). Two datasets are provided regarding the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por) (Cortez et al. 2008). For this analysis, the original dataset (mat and por) have been modified so that the two separate datasets have been joined. Only students who are answered the questionnaire in both Portuguese class are kept. The final data sets used in this analysis includes a total of 382 observation and 35 attributes.

modified_data= read.table("data/modified_data.txt", sep="\t", header=TRUE)

colnames(modified_data)
##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "nursery"    "internet"   "guardian"   "traveltime"
## [16] "studytime"  "failures"   "schoolsup"  "famsup"     "paid"      
## [21] "activities" "higher"     "romantic"   "famrel"     "freetime"  
## [26] "goout"      "Dalc"       "Walc"       "health"     "absences"  
## [31] "G1"         "G2"         "G3"         "alc_use"    "high_use"

In order to study the relationship between high/low alcohol consumption and variables. I chose four variables (“absences”, “sex”, “goout” and “studytime”). My hypothesis about how each variable is related to alcohol consumption shown below:

  • studytime: Students who dedicate quite much amount of time in their study, they don’t have time to go out for drinking alcohol (studytime and high/low alcohol consumption negatively correlated)

  • sex: Male students have higher alcohol consumption than female students (Male students and high/low alcohol consumption positively correlated)

  • goout: Those students going out with friends quite frequently consume high alcohol than students don’t going out. (goout and high/low alcohol consumption positively correlated)

  • absences: Those students who consume more alcohol don’t attend class for various reason (e.g hangover, attending party in class time ) (absences and high/low alcohol consumption positively correlated)


The distributions of my chosen variables and their relationships with alcohol consumption


A bar plot for demonstrating the distributions of my chosen variable

my_variables= c("absences", "sex", "goout", "studytime", "high_use")

my_variables_data <- select(modified_data, one_of(my_variables))

colnames(my_variables_data)
## [1] "absences"  "sex"       "goout"     "studytime" "high_use"
gather(my_variables_data) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar()
## Warning: attributes are not identical across measure variables;
## they will be dropped


A Box plot for demonstrating my chosen variables and their relationships with alcohol consumption

g1 <- ggplot(modified_data, aes(x = high_use, y = absences,col=sex))

p1=g1 + geom_boxplot() + ylab("absences") + ggtitle("Student absences by alcohol consumption and sex")

g2 <- ggplot(modified_data, aes(x = high_use, y = goout, col=sex))

p2=g2 + geom_boxplot() + ylab("going out with friends")  + ggtitle("Student who going out with friends by alcohol consumption and sex") 

g3 <- ggplot(modified_data, aes(x = high_use, y = studytime, col=sex))

p3=g3 + geom_boxplot() + ylab("studytime - weekly study time")  + ggtitle("Student who dedicate time to study by alcohol consumption and sex") 

ggarrange(p1, p2, p3 , labels = c("A", "B","C"), ncol = 2, nrow = 2)

#sex_high_use <- CrossTable(my_variables_data$sex, my_variables_data$high_use)
#goout_high_use <- CrossTable(my_variables_data$goout, my_variables_data$high_use)
#goout_high_use <- CrossTable(my_variables_data$goout, my_variables_data$high_use)
#studytime_high_use <- CrossTable(my_variables_data$studytime, my_variables_data$high_use)


To statistically explore the relationship between my chosen variable and high/low alcohol consumption variable logistic regression model was build using the R function glm().

m <- glm(high_use ~  absences + sex +  goout + studytime , data = modified_data, family = "binomial")


In order to be able to interpret the fitted model in detail, the summary, coefficients and confidence intervals of the fitted model are calculated as shown below.

summary(m)
## 
## Call:
## glm(formula = high_use ~ absences + sex + goout + studytime, 
##     family = "binomial", data = modified_data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9103  -0.7892  -0.5021   0.7574   2.6021  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -3.20483    0.59429  -5.393 6.94e-08 ***
## absences     0.07811    0.02244   3.481 0.000499 ***
## sexM         0.78173    0.26555   2.944 0.003242 ** 
## goout        0.72677    0.11994   6.059 1.37e-09 ***
## studytime   -0.42116    0.17058  -2.469 0.013551 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 465.68  on 381  degrees of freedom
## Residual deviance: 381.31  on 377  degrees of freedom
## AIC: 391.31
## 
## Number of Fisher Scoring iterations: 4
coef(m)
## (Intercept)    absences        sexM       goout   studytime 
## -3.20483157  0.07810746  0.78173290  0.72676824 -0.42116184
OR <- coef(m) %>% exp
CI <- confint(m) %>% exp
## Waiting for profiling to be done...
cbind(OR,CI)
##                     OR      2.5 %    97.5 %
## (Intercept) 0.04056573 0.01222314 0.1263579
## absences    1.08123884 1.03577383 1.1324143
## sexM        2.18525582 1.30370562 3.7010763
## goout       2.06838526 1.64470314 2.6349716
## studytime   0.65628388 0.46510383 0.9099505


As shown above all the four variables are statistically significant. goout has the lowest p-value suggesting a strong association of going out with friends and high/low alcohol consumption. The positive coefficient for goout predictor suggests that all other variables being equal, going out with friends increase alcohol consumption. In the logit model, the response variable is log odds: \(ln(odds) = ln(p/(1-p)) =\alpha+ \beta*x_1 + \beta*x_2 + … + \beta*x_n\). A unit increase in going out with friends increase the log odds by 0.72677. The variable absences and sex have also lowest p-vlaue 0.000499 and 0.003242, respectively. The positive coefficient for absences suggests that all other variables being equal, a unit increase in the absences increase alcohol consumption. A unit increase in absences increase the log odds by 0.07811. sex is also the other significant value with p-value (0.003242) and suggesting association of the sex of the student with high\low alcohol consumption. The positive coefficient for sex suggests that all other variables being equal, the male students are high likely alcohol consumption. The male student increase the log odds by 0.78173. studytime is also the other significant variable and the negative coefficient for this variable suggests that all other variables being equal, a unit increase in studytime reduces the alcohol consumption.A unit increase in studytime reduces the log odds by by 0.42116.

The ratio of expected “success” to “failures” defined as odds: p/(1-p). Odds are an alternative way of expressing probabilities. Higher odds corresponds to a higher probability of success when OR > 1. The exponents of the coefficients of a logistic regression model interpreted as odds ratios between a unit change (vs no change) in the corresponding explanatory variable. The exponents of goout predictor coefficient is 2.06838526 and this suggests a unit change in the goout while all other the variables being equal, the odd ratio is 2.06838526. The exponents of sex predictor coefficient is 2.18525582 and this suggests a unit change in the sex while all other the variables being equal, the odd ratio is 2.18525582. In similar way, The exponents of absences predictor coefficient is 1.08123884 and this suggests a unit change in the absences while all other the variables being equal, the odd ratio is 1.08123884. The exponents of studytime predictor coefficient is 1.08123884 and this suggests a unit change in the studytime while all other the variables being equal, the odd ratio is 0.65628388. Hence odds ratio for the significant goout, sex and absences variables are 2.06838526, 2.1852558 and 1.08123884 respectively. It means that goout, sex and absences increase high alcohol consumption by the factor of around 2.07, 2.19 and 1.08. Whereas the odds ratio for studytime is 0.65628388 and it suggests that studytime decreases high alcohol consumption.

Predictive power of the model
probabilities <- predict(m, type = "response")

# add the predicted probabilities to 'modified_data'

modified_data <- mutate(modified_data, probability = probabilities)

# use the probabilities to make a prediction of high_use

modified_data <- mutate(modified_data, prediction = probability>0.5)

table(high_use = modified_data$high_use, prediction = modified_data$prediction)
##         prediction
## high_use FALSE TRUE
##    FALSE   252   16
##    TRUE     65   49
# the logistic regression model m and dataset alc (with predictions) are available

# define a loss function (average prediction error)
loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}

# compute the average number of wrong predictions in the (training) data
loss_func(modified_data$high_use, modified_data$probability)
## [1] 0.2120419

The above 2x2 cross tabulation indicates that out of 382 observations 81 (65+ 16) were wrongly predicated. The average proportion of incorrect predictions in the data is about 21%. My model has 21% error which is better than the model in DataCamp exercises.


Bonus

# K-fold cross-validation

cv <- cv.glm(data = modified_data, cost = loss_func, glmfit = m, K = 10)

# average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.2094241

The 10-fold cross-validation result indicates is 0.2172775. There is no so much improvement in using the 10-fold cross-validation. There is no obvious smaller prediction error than what I have predicated in the above 2x2 cross tabulation (0.2120419).


Clustering and classification

#install.packages("corrplot") install corrplot package
#install.packages("kableExtra")
library(GGally)
library(ggplot2)
library(tidyr)
library(dplyr)
library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
library(corrplot)
## corrplot 0.84 loaded
library(ggpubr)
library(magrittr)
library(boot)
library(knitr)
library(kableExtra)
#library("DT")
# load the data
data("Boston")

str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506  14

The default installation of R comes with many (usually small) data sets. One of the data sets Boston we are dealing in this week exercise comes from MASS package. The data was originally published by (Harrison et al. 1978) that contains information about the Boston house-price data and later the data was also published by (Belsley et al. 1980). The Boston dataset has 506 observations and 14 different variables. Details about the datasets can be found from this two link [1] and [2]


Boston summary table

Boston_summary= do.call(cbind, lapply(Boston, summary))

kable(Boston_summary,"html", caption="Boston summary table") %>%   kable_styling(bootstrap_options ="condensed",  font_size = 13, full_width = F, position = "left") %>% column_spec(1:1, bold = T,color = "white", background = "green") %>% row_spec(0, bold = T,  color = "white", background = "green")
Boston summary table
crim zn indus chas nox rm age dis rad tax ptratio black lstat medv
Min. 0.006320 0.00000 0.46000 0.00000 0.3850000 3.561000 2.9000 1.129600 1.000000 187.0000 12.60000 0.3200 1.73000 5.00000
1st Qu. 0.082045 0.00000 5.19000 0.00000 0.4490000 5.885500 45.0250 2.100175 4.000000 279.0000 17.40000 375.3775 6.95000 17.02500
Median 0.256510 0.00000 9.69000 0.00000 0.5380000 6.208500 77.5000 3.207450 5.000000 330.0000 19.05000 391.4400 11.36000 21.20000
Mean 3.613524 11.36364 11.13678 0.06917 0.5546951 6.284634 68.5749 3.795043 9.549407 408.2372 18.45553 356.6740 12.65306 22.53281
3rd Qu. 3.677083 12.50000 18.10000 0.00000 0.6240000 6.623500 94.0750 5.188425 24.000000 666.0000 20.20000 396.2250 16.95500 25.00000
Max. 88.976200 100.00000 27.74000 1.00000 0.8710000 8.780000 100.0000 12.126500 24.000000 711.0000 22.00000 396.9000 37.97000 50.00000

Visualizing the scatter plot matrix and examining the correltion between Boston dataset variables


p1=ggpairs(Boston,title="scatter plot matrix with correlation") 
p1 + theme(plot.title = element_text(size = rel(2)))


cor_matrix<-cor(Boston) %>% round(digits=2)

# visualize the correlation matrix

corrplot(cor_matrix, method="circle", type = "lower", cl.pos = "b", tl.col="black", tl.pos = "d", tl.cex = 1.2,title="Correlations plot",mar=c(0,0,1,0))


The above scatter plot matrix and correlation plot describe the distributions of the variables and the relationships between them. In the scatterplot matrix , the diagonal cells show histograms of each of the variables, the lower panel of the scatterplot matrix displayed a scatterplot of a pair of variables and the upper panel of the scatterplot matrix displayed the correlation value of a pair of variables. The correlation plot is a colored representation of the Boston correlation value whare the values are represented as different colors. The correlation values are ranging from red to blue (-1 to 1) and white is the middle value that is zero. For example, in the above correlation plot, we can see that the relation between index of accessibility to radial highways (rad) and full-value property-tax rate per $10,000 (tax) represented in high intensity blue color circle and the scatter plot displayed a correlation value 0.91. Hence, as the index of accessibility to radial highways increases the full-value property tax rate per $10,000 also increase and vice versa. Moreover the above correlation plot displayed high intensity red color circle for the variables nitrogen oxides concentration (parts per 10 million) (nox) and weighted mean of distances to five Boston employment centres (dis) and the scatter plot matrix displayed the correlation value -0.77. Hence, as the nox increases, the dis decrease and vice versa.

boston_scaled <- scale(Boston)


boston_scaled<-as.data.frame(boston_scaled)


Boston scaled summary table

Boston_summary_scaled= do.call(cbind, lapply(boston_scaled, summary))

kable(Boston_summary_scaled,"html") %>%   kable_styling(bootstrap_options ="condensed",  font_size = 13, full_width = F, position = "left") %>% column_spec(1:1, bold = T,color = "white", background = "green") %>% row_spec(0, bold = T,  color = "white", background = "green")
crim zn indus chas nox rm age dis rad tax ptratio black lstat medv
Min. -0.4193669 -0.4872402 -1.5563017 -0.2723291 -1.4644327 -3.8764132 -2.3331282 -1.2658165 -0.9818712 -1.3126910 -2.7047025 -3.9033305 -1.5296134 -1.9063399
1st Qu. -0.4105633 -0.4872402 -0.8668328 -0.2723291 -0.9121262 -0.5680681 -0.8366200 -0.8048913 -0.6373311 -0.7668172 -0.4875567 0.2048688 -0.7986296 -0.5988631
Median -0.3902803 -0.4872402 -0.2108898 -0.2723291 -0.1440749 -0.1083583 0.3170678 -0.2790473 -0.5224844 -0.4642132 0.2745872 0.3808097 -0.1810744 -0.1449159
Mean 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
3rd Qu. 0.0073892 0.0487240 1.0149946 -0.2723291 0.5980871 0.4822906 0.9059016 0.6617161 1.6596029 1.5294129 0.8057784 0.4332223 0.6024226 0.2682577
Max. 9.9241096 3.8004735 2.4201701 3.6647712 2.7296452 3.5515296 1.1163897 3.9566022 1.6596029 1.7964164 1.6372081 0.4406159 3.5452624 2.9865046


One of the main goal of standardizeing the data is to able compare different data setes. The Boston data setes is scaled using the R funciton scale(). From the above summary table of the scaled data set, we can clearly see that each variable has a mean 0.

bins <- quantile(boston_scaled$crim)
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, label=c("low", "med_low", "med_high", "high"))
boston_scaled <- dplyr::select(boston_scaled, -crim)
boston_scaled <- data.frame(boston_scaled, crime)
pairs(boston_scaled[1:13], main = "Scaled scatter plot matrix",
      pch = 21, bg = c("red", "green3", "blue","yellow")[boston_scaled$crime],
      oma=c(4,4,6,12),upper.panel = NULL)
par(xpd=TRUE)
legend(0.85, 0.7, as.vector(unique(boston_scaled$crime)),  
       fill=c("red", "green3", "blue","yellow"))


In scaling the Boston variable, subtract its mean from the variable and divided by its standard devaiation and all the mean became 0 and standard deviation 1. Moreover all the Boston varabiable is standardized using the formula \(\frac{X-\mu}{\sigma}\), there is a change in each of the varible values and we can clearly see big differnce in scaled Boston and orginal Boston data set. For example, if we compare the summary table of the scaled and orginal Boston data set the maximum and minum values are not the same and also the Quantile values. However the corrlation values has not changed in both datasets.

n <- nrow(boston_scaled)

# choose randomly 80% of the rows
ind <- sample(n,  size = n * 0.8)

train <- boston_scaled[ind,]

test <- boston_scaled[-ind,]

# save the correct classes from test data
correct_classes <- test$crime

# remove the crime variable from test data
test <- dplyr::select(test, -crime)
## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2500000 0.2574257 0.2376238 0.2549505 
## 
## Group means:
##                  zn      indus         chas        nox         rm
## low       1.0181294 -0.9012680 -0.077423115 -0.8813836  0.4489710
## med_low  -0.1128898 -0.2642062 -0.007331936 -0.5461982 -0.1613744
## med_high -0.3791541  0.2102954  0.137785540  0.4432902  0.1252475
## high     -0.4872402  1.0170891 -0.119431971  1.1113501 -0.4204577
##                 age        dis        rad        tax     ptratio
## low      -0.9252313  0.8892731 -0.6850891 -0.7303942 -0.45279942
## med_low  -0.3214673  0.3026019 -0.5589261 -0.4739120 -0.01721261
## med_high  0.4058074 -0.4059987 -0.3896929 -0.2721808 -0.33599395
## high      0.8317035 -0.8663562  1.6384176  1.5142626  0.78111358
##                black       lstat        medv
## low       0.37628946 -0.77110785  0.54514207
## med_low   0.31374782 -0.09407750 -0.01172177
## med_high  0.09925633 -0.03119295  0.19022133
## high     -0.74429747  0.93451675 -0.76974319
## 
## Coefficients of linear discriminants:
##                 LD1         LD2         LD3
## zn       0.12403207  0.74000039 -0.83966506
## indus   -0.01027535 -0.19459974  0.24040801
## chas    -0.08453630 -0.01094597  0.10835703
## nox      0.43782585 -0.73908512 -1.41398724
## rm      -0.07613708 -0.03224923 -0.15778131
## age      0.29782982 -0.30112429 -0.13679802
## dis     -0.05591067 -0.20652286  0.02265899
## rad      3.03298093  0.98888906 -0.11246801
## tax     -0.06864096 -0.08774866  0.54080492
## ptratio  0.08890535  0.03164932 -0.15445087
## black   -0.14261608  0.02852735  0.10694737
## lstat    0.18747565 -0.13614712  0.50633296
## medv     0.12505597 -0.40798302 -0.10264502
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9426 0.0420 0.0154


From the above prior probabilities, we can see that the proportion of each groups “low, med_low, med_high, high” in the total observation. Approximately the counts of each group is about 25% of the total observation of the Boston data set. Moreover the group means displayed the mean values of each variable in each groups and we can see that how the mean differs between the groups. Coefficients of linear discriminants is the coefficient of each of variables in the linear combination of the variables. For example the first discriminant function is a linear combination of the variables: \(0.0888*zn + 0.0891*indus -0.0846*chas + 0.165*nox..........+ 0.179*lstat + 0.221*medv\). The proportion of trace explains the between groups variance, in my analysis we can see that the first discriminant explains more than 95% of the between group variance in the Boston dataset


lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

classes <- as.numeric(train$crime)

# plot the lda results
plot(lda.fit, dimen = 2, col=classes)
lda.arrows(lda.fit, myscale = 1.4)


In the above biplot, the index of accessibility to radial highways (rad) has the largest possible variance (that is, accounts for as much of the variability in the Boston data as possible).

lda.pred <- predict(lda.fit, newdata = test)

table(correct = correct_classes, predicted = lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low       11      14        1    0
##   med_low    5      17        0    0
##   med_high   1      13       16    0
##   high       0       0        0   24

Every time I run the code the cross-tabulation value is differnt due to random sample of the train and test data.However in the cross-tabulation, we can see that my model correctly predicted approximately about 70% and my model Misclassification rate is approximately 30%

data('Boston')

boston_scaled2 <- scale(Boston)

dist_eu <- dist(boston_scaled2)

set.seed(12345)

k_max <- 13

twcss <- sapply(1:k_max, function(k){kmeans(boston_scaled2, k)$tot.withinss})

#qplot(x = 1:k_max, y = twcss, geom =c("point", "line"),span = 0.2)

# k-means clustering
b=x = 1:k_max
aa=data.frame(cbind(b,twcss))

ggplot(data=aa, aes(x=b, y=twcss, group=1)) +
  geom_line(color="red")+
  geom_point() + ggtitle("Optimal Number of Clusters")


One way to determine the number of clusters is to look at how the total of within cluster sum of squares (WCSS) behaves when the number of cluster changes. Using the Elbow method, for each successive increase in the number of clusters, the substantial decrease in the within groups sum of squares wcss was observed. The optimal number of clusters is identified when the radical total WCSS drops observed in the line plot. From the above ggpairs plot, we can say that after 2 clusters the observed difference in the within-cluster dissimilarity is not radical. Consequently, we can say with some reasonable confidence that the optimal number of clusters to be used is 2.


Assuming the above optimal number cluster is valid and the K-Means algorithm run again and the plot results are displayed below:


km <-kmeans(boston_scaled2, centers = 2)

# plot the Boston dataset with clusters
pairs(boston_scaled2, main = "K-means clustering",col = km$cluster, upper.panel = NULL)

Bonus

boston_scaled_bonus<-as.data.frame(scale(Boston))
set.seed(12345)
km_bonus<-kmeans(dist_eu, centers = 4)

myclust<-data.frame(km_bonus$cluster)
boston_scaled_bonus$clust<-km_bonus$cluster
lda.fit_bonus<-lda(clust~., data = boston_scaled_bonus)
#lda.fit_bonus


lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=2.5)
}


# plot the lda results
plot(lda.fit_bonus, dimen = 2, col=classes)
lda.arrows(lda.fit_bonus, myscale = 2.6)

In the above biplot, the proportion of non-retail business acres per town (indus) has the largest possible variance (that is, accounts for as much of the variability in the Boston data as possible) when I used the cluster number 4.


Dimensionality reduction techniques

#install.packages("FactoMineR")
library(GGally)
library(ggplot2)
library(tidyr)
library(dplyr)
library(corrplot)
library(ggpubr)
library(magrittr)
library(boot)
library(knitr)
library(kableExtra)
library(FactoMineR)

The Human dataset

human<-read.table("data/human.csv")
str(human)
## 'data.frame':    155 obs. of  8 variables:
##  $ Edu2.FM  : num  1.007 0.997 0.983 0.989 0.969 ...
##  $ Labo.FM  : num  0.891 0.819 0.825 0.884 0.829 ...
##  $ Edu.Exp  : num  17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
##  $ Life.Exp : num  81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
##  $ GNI      : int  64992 42261 56431 44025 45435 43919 39568 52947 42155 32689 ...
##  $ Mat.Mor  : int  4 6 6 5 6 7 9 28 11 8 ...
##  $ Ado.Birth: num  7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
##  $ Parli.F  : num  39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...
dim(human)
## [1] 155   8


This week, I am dealing with “human” data which is generated by combing two dataset “Human development” and “Gender inequality”. The original data “Human development” and “Gender inequality” originated from the United Nations Development Programme (UNDP) and additional information about the data can be found [1] and [2]. Before combing the two dataset, “Gender inequality” data was mutated by creating two new variables. The first new variable created is the ratio of Female and Male populations with secondary education in each country. (i.e. edu2F / edu2M). The second new variable is the ratio of labour force participation of females and males in each country (i.e. labF / labM). Using the variable Country as the identifier, the two datasets join together. After combing the dataset further modified, by excluding unneeded variables, removing all rows with missing value and keeping observations which related to countries. Finally the human dataset have 155 observations and 8 variables. All the eight variables are continues variable.


human_summary= do.call(cbind, lapply(human, summary))

kable(human_summary,"html", caption="human data summary table") %>%   kable_styling(bootstrap_options ="condensed",  font_size = 14, full_width = F) %>% column_spec(1:1, bold = T,color = "white", background = "green") %>% row_spec(0, bold = T,  color = "white", background = "green")
human data summary table
Edu2.FM Labo.FM Edu.Exp Life.Exp GNI Mat.Mor Ado.Birth Parli.F
Min. 0.1717172 0.1856946 5.40000 49.00000 581.0 1.0000 0.60000 0.00000
1st Qu. 0.7264088 0.5984174 11.25000 66.30000 4197.5 11.5000 12.65000 12.40000
Median 0.9375000 0.7534669 13.50000 74.20000 12040.0 49.0000 33.60000 19.30000
Mean 0.8528640 0.7074314 13.17613 71.65355 17627.9 149.0839 47.15871 20.91161
3rd Qu. 0.9968404 0.8535430 15.20000 77.25000 24512.0 190.0000 71.95000 27.95000
Max. 1.4967320 1.0380368 20.20000 83.50000 123124.0 1100.0000 204.80000 57.50000


Visualizing the scatter plot matrix and examining the correltion between human data variables

p1=ggpairs(human,title="scatter plot matrix with correlation") 
p1 + theme(plot.title = element_text(size = rel(2)))


The above scatter plot matrix indicates that some of the variables are skewed: for example the variables GIN, Mat.Mor and Ado.Birth are positively skewed. Whereas, Life.Exp and Labo.FM are negatively skewed. From the scatter plot matrix we can also see that the variables Edu2.FM and Edu.Exp are relatively follow normal distribution.


cor_matrix<-cor(human) %>% round(digits=2)

# visualize the correlation matrix

corrplot(cor_matrix, method="circle", type = "lower", cl.pos = "b", tl.col="black", tl.pos = "d", tl.cex = 1.2,title="Correlations plot",mar=c(0,0,1,0))

The above correlation plot shows the relationships between variables. The correlation plot is a colored representation of the human data correlation value where the values are represented as different colors. The correlation values are ranging from red to blue (-1 to 1) and white is the middle value that is zero. For example adolescent birth rate (Ado.Birth) is positively correlated with maternal mortality ratio which is represented in high intensity blue color circle and the scatter plot displayed a correlation value (0.759) but negatively correlated (-0.857) with life expectancy at birth (Life.Exp). Similarly, ratio of females and males with secondary education (Edu2.FM) and expected years of schooling (Edu.Exp) are both positively correlated with life expectancy at birth (Life.Exp). On the other hand, there is very little correlation between the ratio of females and males in labour force (Labo.FM) with Edu.Exp and GNI.


pca_human_no_standard <- prcomp(human)
pca_human_no_standard
## Standard deviations (1, .., p=8):
## [1] 1.854416e+04 1.855219e+02 2.518701e+01 1.145441e+01 3.766241e+00
## [6] 1.565912e+00 1.912052e-01 1.591112e-01
## 
## Rotation (n x k) = (8 x 8):
##                     PC1           PC2           PC3           PC4
## Edu2.FM   -5.607472e-06  0.0006713951 -3.412027e-05 -2.736326e-04
## Labo.FM    2.331945e-07 -0.0002819357  5.302884e-04 -4.692578e-03
## Edu.Exp   -9.562910e-05  0.0075529759  1.427664e-02 -3.313505e-02
## Life.Exp  -2.815823e-04  0.0283150248  1.294971e-02 -6.752684e-02
## GNI       -9.999832e-01 -0.0057723054 -5.156742e-04  4.932889e-05
## Mat.Mor    5.655734e-03 -0.9916320120  1.260302e-01 -6.100534e-03
## Ado.Birth  1.233961e-03 -0.1255502723 -9.918113e-01  5.301595e-03
## Parli.F   -5.526460e-05  0.0032317269 -7.398331e-03 -9.971232e-01
##                     PC5           PC6           PC7           PC8
## Edu2.FM   -0.0022935252  2.180183e-02  6.998623e-01  7.139410e-01
## Labo.FM    0.0022190154  3.264423e-02  7.132267e-01 -7.001533e-01
## Edu.Exp    0.1431180282  9.882477e-01 -3.826887e-02  7.776451e-03
## Life.Exp   0.9865644425 -1.453515e-01  5.380452e-03  2.281723e-03
## GNI       -0.0001135863 -2.711698e-05 -8.075191e-07 -1.176762e-06
## Mat.Mor    0.0266373214  1.695203e-03  1.355518e-04  8.371934e-04
## Ado.Birth  0.0188618600  1.273198e-02 -8.641234e-05 -1.707885e-04
## Parli.F   -0.0716401914 -2.309896e-02 -2.642548e-03  2.680113e-03
s_no_standard=summary(pca_human_no_standard)

pca_pr_no_standard <- round(100*s_no_standard$importance[2, ], digits = 1)
pc_lab_no_standard=paste0(names(pca_pr_no_standard), " (", pca_pr_no_standard, "%)")

biplot(pca_human_no_standard, cex = c(0.8, 1), col = c("black", "red"), xlab = pc_lab_no_standard[1], ylab = pc_lab_no_standard[2])
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped


pca_human_standard <- prcomp(human,scale. = TRUE)
pca_human_standard
## Standard deviations (1, .., p=8):
## [1] 2.0708380 1.1397204 0.8750485 0.7788630 0.6619563 0.5363061 0.4589994
## [8] 0.3222406
## 
## Rotation (n x k) = (8 x 8):
##                   PC1         PC2         PC3         PC4        PC5
## Edu2.FM   -0.35664370  0.03796058 -0.24223089  0.62678110 -0.5983585
## Labo.FM    0.05457785  0.72432726 -0.58428770  0.06199424  0.2625067
## Edu.Exp   -0.42766720  0.13940571 -0.07340270 -0.07020294  0.1659678
## Life.Exp  -0.44372240 -0.02530473  0.10991305 -0.05834819  0.1628935
## GNI       -0.35048295  0.05060876 -0.20168779 -0.72727675 -0.4950306
## Mat.Mor    0.43697098  0.14508727 -0.12522539 -0.25170614 -0.1800657
## Ado.Birth  0.41126010  0.07708468  0.01968243  0.04986763 -0.4672068
## Parli.F   -0.08438558  0.65136866  0.72506309  0.01396293 -0.1523699
##                   PC6         PC7         PC8
## Edu2.FM    0.17713316  0.05773644  0.16459453
## Labo.FM   -0.03500707 -0.22729927 -0.07304568
## Edu.Exp   -0.38606919  0.77962966 -0.05415984
## Life.Exp  -0.42242796 -0.43406432  0.62737008
## GNI        0.11120305 -0.13711838 -0.16961173
## Mat.Mor    0.17370039  0.35380306  0.72193946
## Ado.Birth -0.76056557 -0.06897064 -0.14335186
## Parli.F    0.13749772  0.00568387 -0.02306476
s_standard=summary(pca_human_standard)

pca_pr_standard <- round(100*s_standard$importance[2, ], digits = 1)

pc_lab_standard=paste0(names(pca_pr_standard), " (", pca_pr_standard, "%)")

biplot(pca_human_standard, cex = c(0.8, 1), col = c("black", "red"), xlab = pc_lab_standard[1], ylab = pc_lab_standard[2])


There is a big difference between the results of PCA with and without standardizing data. As it is explained in the slide “PCA is sensitive to the relative scaling of the original features and assumes that features with larger variance are more important than features with smaller variance”. Hence Performing PCA on unstandardized variables will results to large loadings for variables with larger variance.For example: From the summary table, we can clearly see that among eight variables, the GIN variable has a large variance. this will lead to dependence of a principal component on the GIN variable with high variance. You can see, first principal component is dominated by a variable GIN.PC1 explains nearly all of the variance (99.99%).

When the variables are scaled, we get a much better representation of variables.The first two principal components explained about ~70% of the variance. The first principal component (PC1) explains 53.% of the variation compared to 100% when the data was not scaled.The second principal component (PC2) explains 16.2% of the variance.

From the standardized PCA plot, we can see that four variables, namely Edu.Exp, Life.Exp, GNU and EDU.FM are positively correlated since the angle between these variables arrows are relatively small, out of which GNU and EDU2.FM have the highest correlation since the angle between these two variables are smallest. Similarly, Parli.F and Labo.FM are also positively correlated and so are the variables Mat.Mor and Ado.Birth. Whereas Parli.F and Labo.FM are almost orthogonal to the other variables and show small correlation value. Furthermore, the plot also shows that Life.Exp and Ado.Birth are least correlated as they are farthest in the plot (notice the large angle between these two variables). From the biplot, we cal also see that Parli.F and Labo.FM are positively correlated to PC2 (i.e they are contributing the direction of PC2). whereas other variables are positively correlated to PC1 (i.e they are contributing the direction of PC1).


The Tea dataset

The tea data we are dealing in this week exercise comes from comes the R package “FactoMineR”. It is data frame with 300 rows and 36 columns. The data used here came from the survey conducted on a sample of 300 tea consumers.

data("tea")
str(tea)
## 'data.frame':    300 obs. of  36 variables:
##  $ breakfast       : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
##  $ tea.time        : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
##  $ evening         : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
##  $ lunch           : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dinner          : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
##  $ always          : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
##  $ home            : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
##  $ work            : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
##  $ tearoom         : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ friends         : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
##  $ resto           : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
##  $ pub             : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Tea             : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How             : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ sugar           : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ how             : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ where           : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ price           : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
##  $ age             : int  39 45 47 23 48 21 37 36 40 37 ...
##  $ sex             : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
##  $ SPC             : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
##  $ Sport           : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
##  $ age_Q           : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
##  $ frequency       : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
##  $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
##  $ spirituality    : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
##  $ healthy         : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
##  $ diuretic        : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
##  $ friendliness    : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
##  $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ feminine        : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
##  $ sophisticated   : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
##  $ slimming        : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ exciting        : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
##  $ relaxing        : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
##  $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
dim(tea)
## [1] 300  36
#g1<-gather(tea) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") 
#g1+ geom_bar() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))

summary(tea)
##          breakfast           tea.time          evening          lunch    
##  breakfast    :144   Not.tea time:131   evening    :103   lunch    : 44  
##  Not.breakfast:156   tea time    :169   Not.evening:197   Not.lunch:256  
##                                                                          
##                                                                          
##                                                                          
##                                                                          
##                                                                          
##         dinner           always          home           work    
##  dinner    : 21   always    :103   home    :291   Not.work:213  
##  Not.dinner:279   Not.always:197   Not.home:  9   work    : 87  
##                                                                 
##                                                                 
##                                                                 
##                                                                 
##                                                                 
##         tearoom           friends          resto          pub     
##  Not.tearoom:242   friends    :196   Not.resto:221   Not.pub:237  
##  tearoom    : 58   Not.friends:104   resto    : 79   pub    : 63  
##                                                                   
##                                                                   
##                                                                   
##                                                                   
##                                                                   
##         Tea         How           sugar                     how     
##  black    : 74   alone:195   No.sugar:155   tea bag           :170  
##  Earl Grey:193   lemon: 33   sugar   :145   tea bag+unpackaged: 94  
##  green    : 33   milk : 63                  unpackaged        : 36  
##                  other:  9                                          
##                                                                     
##                                                                     
##                                                                     
##                   where                 price          age        sex    
##  chain store         :192   p_branded      : 95   Min.   :15.00   F:178  
##  chain store+tea shop: 78   p_cheap        :  7   1st Qu.:23.00   M:122  
##  tea shop            : 30   p_private label: 21   Median :32.00          
##                             p_unknown      : 12   Mean   :37.05          
##                             p_upscale      : 53   3rd Qu.:48.00          
##                             p_variable     :112   Max.   :90.00          
##                                                                          
##            SPC               Sport       age_Q          frequency  
##  employee    :59   Not.sportsman:121   15-24:92   1/day      : 95  
##  middle      :40   sportsman    :179   25-34:69   1 to 2/week: 44  
##  non-worker  :64                       35-44:40   +2/day     :127  
##  other worker:20                       45-59:61   3 to 6/week: 34  
##  senior      :35                       +60  :38                    
##  student     :70                                                   
##  workman     :12                                                   
##              escape.exoticism           spirituality        healthy   
##  escape-exoticism    :142     Not.spirituality:206   healthy    :210  
##  Not.escape-exoticism:158     spirituality    : 94   Not.healthy: 90  
##                                                                       
##                                                                       
##                                                                       
##                                                                       
##                                                                       
##          diuretic             friendliness            iron.absorption
##  diuretic    :174   friendliness    :242   iron absorption    : 31   
##  Not.diuretic:126   Not.friendliness: 58   Not.iron absorption:269   
##                                                                      
##                                                                      
##                                                                      
##                                                                      
##                                                                      
##          feminine             sophisticated        slimming  
##  feminine    :129   Not.sophisticated: 85   No.slimming:255  
##  Not.feminine:171   sophisticated    :215   slimming   : 45  
##                                                              
##                                                              
##                                                              
##                                                              
##                                                              
##         exciting          relaxing              effect.on.health
##  exciting   :116   No.relaxing:113   effect on health   : 66    
##  No.exciting:184   relaxing   :187   No.effect on health:234    
##                                                                 
##                                                                 
##                                                                 
##                                                                 
## 

Among 36 columns (variables), I choose the following variables:

  1. What kind of tea do you consume most often (black, Earl Grey) Tea
  2. Do you consume tea at home (yes, no) home
  3. Do you add sugar (yes, no) sugar
  4. Do you always drink tea (always, not always) always
  5. Do you drink tea after lunch (yes, no) lunch
  6. Do you think tea is slimming (yes, no) ** slimming**
  7. Does tea have an effect on health (yes, no) health
#select my interest of variables 
keep_columns <-  c("Tea", "home", "sugar", "always", "lunch","slimming", "health")
#my interest of  new dataset
tea_my <- dplyr::select(tea, one_of(keep_columns))
## Warning: Unknown variables: `health`
str(tea_my)
## 'data.frame':    300 obs. of  6 variables:
##  $ Tea     : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ home    : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
##  $ sugar   : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ always  : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
##  $ lunch   : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
##  $ slimming: Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
g1<-gather(tea_my) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") 
## Warning: attributes are not identical across measure variables;
## they will be dropped
g1+ geom_bar() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))

summary(tea_my)
##         Tea            home          sugar            always   
##  black    : 74   home    :291   No.sugar:155   always    :103  
##  Earl Grey:193   Not.home:  9   sugar   :145   Not.always:197  
##  green    : 33                                                 
##        lunch            slimming  
##  lunch    : 44   No.slimming:255  
##  Not.lunch:256   slimming   : 45  
## 
# multiple correspondence analysis
mca <- MCA(tea_my, graph = FALSE)
# summary of the model
summary(mca)
## 
## Call:
## MCA(X = tea_my, graph = FALSE) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6
## Variance               0.224   0.190   0.175   0.165   0.149   0.140
## % of var.             19.213  16.293  14.997  14.152  12.759  11.962
## Cumulative % of var.  19.213  35.506  50.503  64.655  77.414  89.376
##                        Dim.7
## Variance               0.124
## % of var.             10.624
## Cumulative % of var. 100.000
## 
## Individuals (the 10 first)
##               Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3    ctr
## 1          | -0.262  0.102  0.082 | -0.225  0.089  0.060 |  0.321  0.197
## 2          | -0.734  0.801  0.661 | -0.149  0.039  0.027 |  0.379  0.274
## 3          | -0.142  0.030  0.051 | -0.183  0.059  0.084 |  0.092  0.016
## 4          |  0.330  0.162  0.259 | -0.260  0.118  0.160 |  0.034  0.002
## 5          |  0.022  0.001  0.001 |  0.275  0.133  0.120 | -0.395  0.298
## 6          | -0.142  0.030  0.051 | -0.183  0.059  0.084 |  0.092  0.016
## 7          | -0.142  0.030  0.051 | -0.183  0.059  0.084 |  0.092  0.016
## 8          | -0.734  0.801  0.661 | -0.149  0.039  0.027 |  0.379  0.274
## 9          | -0.142  0.030  0.051 | -0.183  0.059  0.084 |  0.092  0.016
## 10         | -0.734  0.801  0.661 | -0.149  0.039  0.027 |  0.379  0.274
##              cos2  
## 1           0.123 |
## 2           0.176 |
## 3           0.021 |
## 4           0.003 |
## 5           0.248 |
## 6           0.021 |
## 7           0.021 |
## 8           0.176 |
## 9           0.021 |
## 10          0.176 |
## 
## Categories (the 10 first)
##                Dim.1     ctr    cos2  v.test     Dim.2     ctr    cos2
## black      |  -1.138  23.767   0.424 -11.263 |   0.235   1.190   0.018
## Earl Grey  |   0.543  14.097   0.532  12.607 |   0.143   1.161   0.037
## green      |  -0.622   3.166   0.048  -3.782 |  -1.365  17.969   0.230
## home       |  -0.084   0.515   0.231  -8.306 |   0.028   0.067   0.026
## Not.home   |   2.731  16.640   0.231   8.306 |  -0.908   2.170   0.026
## No.sugar   |  -0.648  16.143   0.449 -11.589 |   0.096   0.420   0.010
## sugar      |   0.693  17.257   0.449  11.589 |  -0.103   0.449   0.010
## always     |   0.306   2.389   0.049   3.825 |   0.787  18.660   0.324
## Not.always |  -0.160   1.249   0.049  -3.825 |  -0.412   9.756   0.324
## lunch      |   0.561   3.430   0.054   4.021 |   0.149   0.287   0.004
##             v.test     Dim.3     ctr    cos2  v.test  
## black        2.321 |   0.714  11.963   0.167   7.060 |
## Earl Grey    3.331 |  -0.008   0.004   0.000  -0.179 |
## green       -8.298 |  -1.555  25.336   0.299  -9.453 |
## home         2.762 |  -0.084   0.655   0.229  -8.276 |
## Not.home    -2.762 |   2.721  21.165   0.229   8.276 |
## No.sugar     1.721 |   0.070   0.243   0.005   1.256 |
## sugar       -1.721 |  -0.075   0.260   0.005  -1.256 |
## always       9.844 |  -0.803  21.099   0.337 -10.043 |
## Not.always  -9.844 |   0.420  11.032   0.337  10.043 |
## lunch        1.071 |   0.612   5.227   0.064   4.385 |
## 
## Categorical variables (eta2)
##              Dim.1 Dim.2 Dim.3  
## Tea        | 0.552 0.232 0.392 |
## home       | 0.231 0.026 0.229 |
## sugar      | 0.449 0.010 0.005 |
## always     | 0.049 0.324 0.337 |
## lunch      | 0.054 0.004 0.064 |
## slimming   | 0.010 0.545 0.022 |

About 50.5 % of the variation is captured by first three dimensions, firs one capturing 19.21% of the variance, second one 16.29% and third one 14.99% of the variance.The Categorical variables shows the squared correlations between the variables and dimensions. Type of Tea has strong squared correlation to first dimension (0.552) and modest squared correlation to both dimensions two (0.232) and three (0.392 ). slimming has strong squared correlation to dimension two (0.545) and smallest squared correlation to both dimensions one (0.010) and three (0.022).


# visualize MCA
plot(mca, invisible=c("ind"),habillage = "quali")


The MCA biplot shows how similar the variables are with each other. It seems that ‘type of tea’ and ‘drinking tea after lunch’ are very similar, and ‘not tea time’ and ‘male’. The type of tea most similar with ‘sugar’ is ‘Earl Grey’ dringking after ‘lunch’.